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Abstract 



The ultimate goal of optimization is to find the minimizer of a target function. However, typical criteria 
for active optimization often ignore the uncertainty about the minimizer. We propose a novel criterion 
for global optimization and an associated sequential active learning strategy using Gaussian processes. 
Our criterion is the reduction of uncertainty in the posterior distribution of the function minimizer. It 
can also flexibly incorporate multiple global minimizers. We implement a tractable approximation of 
the criterion and demonstrate that it obtains the global minimizer accurately compared to conventional 
Bayesian optimization criteria. 

1 Introduction 

Exploring an unknown parameter space in search for globally optimal solution can be quite costly. The 
aim of active optimization is to carefully choose where to sample in order to reduce the number of sample 
acquisitions; hence, reduce their cost pp. 

While possible, learning the response surface / followed by a search for the minimizer is typically wasteful 
as not all regions of the response surface are of interest; we do not need the details of the response surface 
in regions far from the optimum. Under the active Bayesian optimization framework, the goal is to query 
the oracle, potentially noisy, as few times as possible while quickly gaining knowledge of the minimizer x* — 
arg min^. f{x). Prior works mostly focus on finding x* by obtaining the function minimum /* |21l3j!4||5l[6l[7l|8]. 
Such criteria will drive the sampling procedure towards improving the estimate of /* and providing an 
estimate of x* as a consequence. Since /* is unique while x* might not be, such approaches often discard 
potential minimizers. In design problems with cost constraints, this could lead to discarding viable and 
cost-effective solutions. 

In this paper, we use a acquisition criterion that maximizes the information gain about the minimizer, 
or equivalently minimizes minimizer entropy (MME) j^l 110] . MME provides a balance between exploration 
and exploitation that is tailored specifically for finding the minimizers of global optimization problems. As a 
result, the MME samples densely around potential minimizers, and sparsely in the other region of the input 
space (Fig. [I] and |3|. Furthermore, since a global map of potential minimizers is maintained, MME enables 
us to obtain multiple global minimizers. 

2 Optimization Framework 

We consider optimization target that is a continuous real-valued function / : X h-> R, where X C K d is 
bounded. Furthermore, we assume / has a unique minimizer x* = argmin^g^ / (x) (an assumption relaxed 
later) and that each observation is noisy; i.e. y\f,x ~ J\f(f(x),a 2 ). The objective of the optimization is to 
find the function's minimizer x* and its corresponding minimum /* = min x£ x f(x). 

The Bayesian optimization framework has been proposed to arrive to an e-close solution in a sub- 
exponential number of function evaluations on average EH IS]- While it is possible to devise strategies 
seeking the jointly-optimal n samples, the computational cost is often prohibitive (2j El E] ; hence, a greedy 
(or a one-step lookahead) sequential approach is typically used where the next sample is chosen according to 
an acquisition criterion. Popular acquisition criteria are summarized in the table below: 



Criterion 


x n+ i = argmax^g^ of 


Description 


Kushner [5] 


Pr(/(a;)</*-e) 


Samples the point with the highest probability 
of lying below the current minimum estimate. 


Mockus E] 


E|(/(ar)-/;-e) + J 


Samples the point with the largest expected im- 
provement over the current minimum estimate. 



3 Proposed Acquisition Criterion: The MME Criterion 

The Bayesian framework, when applied to functional estimation, defines a prior p (/) over the functions / 
and a corresponding posterior p(f\T>„) after n observations. Statistical inference on x* requires the posterior 
p(x*\T> n ). The minimizer x* relates deterministically to / through the highly nonlinear "argmin" operation; 
hence, it is intractable to compute p{x*\V n ) from the posterior p(f\T> n ). In particular, consider the set of 
points for which the function values are close to the optimum, A = {x e X : f * + e > f(x)\T> n } for a small 
e > 0. A may not be localized even for a smooth true / since the function could have multiple disjoint e-close 
optimum regions (possibly due to multiple optimizers), making the minimizer distribution x*\T> n often quite 
complex. Therefore, in this paper, we propose utilizing the inference on f\T> n as an intermediate step to learn 
x* [D n more efficiently by focusing the sampling on the regions of X that contribute the most information 
about the minimizer. 

Let x* n — x*\T> n be the random variable x* representing the minimizer conditioned on n observations. Our 
proposed criterion MME minimizes the minimizer entropy H(x*), where H(-) denotes the entropy functional. 
In this paper, we focus on a sequential sampling scheme where we seek the next point x n +i that minimizes 
the entropy of the minimizer given the additional sample (x n +i, y n +i). Thus, the next sample point is given 
by: 

argminff (x* +1 ) = argmin E yn+1 [H [x*\V n , {x n+ x, y n +i))] ■ (1) 

Xn+1 X„+l 

A straightforward evaluation of requires the computation of 

p{x*\V n+l )= [p(x*\f)p(f\V n+1 )df = [ 6(x* -argmin f(x))p(f\V n+1 ) df. (2) 



Since direct evaluation of ^ is intractable in general, we develop a more tractable approximation. In 
this paper, we utilize the widely used Gaussian process framework for /|X> n+1 [TJ [5J |5J QT| . The minimizer's 
posterior in |2]) can be pointwise bounded as follows 

P (x* =x\V n+1 ) =p(f(x) <f(x'),Vx' G X\V n+1 ) < p (/(z) < /*|£>„+i) (3) 

where /* = f(x*) is our current estimate of the minimum. The equality results from the definition of the 
minimizer, while the inequality is due to the fact that f(x) < f (x 1 ) ,Vx' E X implies f(x) < f(x*). The 
upper bound in (|3| is equal to g n +i {x) where 

g n (x)=p(f(x) < f(x*)\V n ) =$ ( ] K f r x \ ( 4 ) 

V v/var [f(x*)] + var [f(x)] - 2 cov [/(*), /(£»)] J 

where $ denotes normal cdf, and the posterior means, variances, and covariances can be found in [11 . Finally, 
we normalize g n (x) and use it as a proxy f n (x) « p{x* = x\T> n ). 

Notice that the approximation step leads to a broader distribution with respect to the true posterior 
x*\T> n ; hence, the resulting entropy is an upper bound as well. Moreover, in the noiseless case, the two 
distributions converge (for functions with unique minimizer); i.e., when the posterior f[D n approaches to 
the true /, which is a delta function at the minimizer. A natural advantage of this approximation is that 
it generalizes to functions with multiple global minimizers through the multi-modal f{x) (see Figure [lj). 
However, having multiple minimizers leads to ambiguity in the posterior covariance term, cov [f(x), f(x*)], 
in Q. In this case, we treat f(x) and /(£*) as independent and remove the covariance term from Q. 

Implementation: For each sample acquisition, we have to estimate 0. This requires an expectation 
over j/ n +i for each candidate x n+ i] we use a Monte Carlo approach sampling y n +i\D n , x n+ i under the prior. 
Given each y„+i, we use the approximation Q and evaluate the entropy Q. This is done for each candidate 
on a grid, and the candidate that minimizes the criterion is chosen. An alternative to sampling y n +i, which 
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Figure 1: Illustration of minimizcr distribution of f(x) — (—e~ x + 1) cos(37ra;) in [—1.5, 1.5] with additive 
Gaussian noise (variance (0.1) 2 ). (Top) True target function / and estimated posterior distribution via GP 
from 20 samples using MME. (Bottom) Minimizer distribution estimated from random samples of GP, and 
approximation by Q. 

can be costly, is to further approximate the expected posterior entropy by assuming the posterior mean 
function remains constant - we refer to the algorithm with this extra assumption the "fast" version. 

GP requires selection of kernels and associated hyperparameters. Choosing a good hyperparameter is 
critical for good small sample performance, and convergence to global solution. After acquiring each sample, 
we use evidence optimization to infer hyperparamters (including a 2 ) [111 I12| . For the examples in the result 
section, we used isometric squared exponential kernel with two hyperparameters |11| . and a constant mean 
function (1 hyperparameter). 

4 Results 

4.1 ID toy example 

To illustrate the main ideas, we demonstrate the algorithm on a ID function under additive Gaussian noise 
(Figure [T]). The ID function has two global minima and two local minima; therefore, the minimizer distri- 
bution is multi-modal. Figure ^ shows the sampling distribution of the minimizer and our approximation of 
it. As expected, it has two peaks corresponding to the two global minima. Furthermore, Figure [2] shows the 
evolution of the minimizer's posterior and its convergence to the sharp bimodal form given in Figure [T] 

4.2 2D examples 

We compare our criterion against the popular criterion proposed by Mockus |3j, also called Maximum Expected 
Improvement (MEI), on two 2D test functions with a noise variance of (0.1) 2 : Hosaki function (1 local, 1 
global minimum) ^1^], and the Dixon-Szego 6-hump camel test function (2 local, 2 global minimum) [14 . 
In addition, to illustrate the effectiveness the Bayesian Optimization framework, we compare against the 
state-of-the-art active response surface method proposed by Krause et al. [TS]. All algorithms were applied 
under the same prior and hyperparameter selection procedure with the only difference being the acquisition 
criterion. Both MEI and the response surface approach require a relatively good (initial) estimate of the 
hyperparameters, therefore we initialize them with 10 random samples. Also, since the response surface 
method only works when the candidate set is finite, we restrict the sampling to be on a 15 x 15 grid for 
all algorithms. Fig. [3}Y shows the convergence for the Hosaki function in terms of median of the estimated 
minimum function values obtained by each method from the 20 repetitions. Both MME and MEI performed 
well, while the response surface method has slower convergence and underestimates the correct minimum 
value due to the inaccurate estimation of a 2 . For the Dixon-Szego test function, the MEI constantly drew 
samples (black dots in Fig. |3j3) near one of the global minimizers, and thus failed to find the other minimum. 
The response surface method drew samples from all over the space and found the minimizers correctly, but 
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Figure 2: Convergence of the posterior minimizer distribution. The target function is identical to Fig. [T] 
without noise. A: Evolution of posterior distribution with the number of observations. Each vertical slice is 
colored density (transformed to enhance visualization). Note that it converges to two peaks corresponding 
to true minimizers. B: KL-divergence between the true minimizer measure and the posterior minimizer 
distribution, and the entropy of the minimizer distribution (top). Estimated function minima (bottom). 
Each trace is a median of 41 Monte Carlo runs. 



the value of the minima were not as accurate. On the other hand. MME found the correct minimizers and 
accurate minimum values compared to the other two methods. The estimated minimum values at the global 
minimizers are shown in the table below. 





true 


M 


K 


MME 




-0.999 


1.185 


-1.202 


-0.960 


TO 2 


-0.999 


-0.975 


-0.791 


-1.020 



5 Discussion 

We proposed an information theoretic active optimization criterion by focusing on learning the minimizer 
distribution. The problem of optimizing an unknown function is transformed to minimization of estimated 
entropy of minimizer obtained by Gaussian process. We plan to improve computational complexity and 
approximation accuracy in future work. 
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for Dixon-Szego 6-hump carnal function with two global minima (red dots). 
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